Dominant speaker detection based on voicing for adaptive audio-visual ASR robust to speech noise
نویسنده
چکیده
We investigate the use of voicing in state-of-the-art Large Vocabulary Continuous Audio-visual automatic Speech Recognition (AV-LVCSR). In this work we apply an original adaptive weighting function using voicing level to estimate the appropriate combination weights for each of the modalities. We show that we can improve the state-of-the-art AV-LVCSR performance under speech noise by using a detector of the dominant speaker which is a function of the voicing level. We refine the weighting function according to sensibility and specificity of the dominant speaker detector. In this first experiment, weighting functions are threshold functions of the voicing level. Rather than testing all possible thresholds, three of them are arbitrarily chosen so that the sensitivity, or specificity of the detector, reaches 95%, or so that sensitivity and specificity are equal. Results show that the AV-LVCSR system we use is improved by 5.7% using a weighing function with high sensibility to dominant speaker activity.
منابع مشابه
An Improvement in Audio-Visual Voice Activity Detection for Automatic Speech Recognition
Noise-robust Automatic Speech Recognition (ASR) is essential for robots which are expected to communicate with humans in a daily environment. In such an environment, Voice Activity Detection (VAD) strongly affects the performance of ASR because there are many acoustically and visually noises. In this paper, we improved Audio-Visual VAD for our two-layered audio visual integration framework for ...
متن کاملAutomatic speechreading of impaired speech
We investigate the use of visual, mouth-region information in improving automatic speech recognition (ASR) of the speech impaired. Given the video of an utterance by such a subject, we first extract appearance-based visual features from the mouth region-of-interest, and we use a feature fusion method to combine them with the subject’s audio features into bimodal observations. Subsequently, we a...
متن کاملTwo-layered audio-visual integration in voice activity detection and automatic speech recognition for robots
Automatic Speech Recognition (ASR) which plays an important role in human-robot interaction should be noise-robust because robots are expected to work in noisy environments. Audio-Visual (AV) integration is one of the key ideas to improve the robustness in such environments. This paper proposes two-layered AV integration for ASR which applies AV integration to Voice Activity Detection (VAD) and...
متن کاملA study of voice activity detection techniques for NIST speaker recognition evaluations
Since 2008, interview-style speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/non-speech seg...
متن کاملReal time robust speech detection for text independent speaker recognition
Speaker recognition systems employ a speech detection algorithm and use only frames detected as speech for further processing. The accuracy obtained by a speaker recognition system depends on the method that is used to detect speech, in particular for real-life deployments where the incoming speech varies significantly in loudness and noise characteristics. Also, actual deployments mandate real...
متن کامل